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Confexf. Continuum radiative-transfer simulations are necessary for the interpretation of observations of dusty astrophysical objects 

and for relating the results of magnetohydrodynamical simulations to observations. The calculations are computationally difficult, and 

simulations of objects with high optical depths in particular require considerable computational resources. 

Aims. Our aim is to show how radiative transfer calculations on adaptive three-dimensional grids can be accelerated. 

Methods. We show how the hierarchial tree structure of the model can be used in the calculations. We develop a new method for 

calculating the scattered flux that employs the grid structure to speed up the computation. We describe a novel subiteration algorithm 

that can be used to accelerate calculations with strong dust temperature self-coupling. We compute two test models, a molecular cloud 

and a circumstellar disc, and compare the accuracy and speed of the new algorithms against existing methods. 

Results. An adaptive model of the molecular cloud with fewer than 8 % of the cells in the uniform grid produces results in good 
agreement with the full resolution model. The relative root-mean-square (RMS) error of the surface brightness is £ 4 % at all 
wavelengths, and in regions of high column density the relative RMS error is only ~ 1CT 4 . Computation with the adaptive model is 
faster by a factor of ~ 5. Our new method for calculating the scattered flux is faster by a factor of about four in large models with a 
deep hierarchy structure, when images of the scattered light are computed towards several observing directions. The efficiency of the 
subiteration algorithm is highly dependent on the details of the model. In the circumstellar disc test the speed-up is a factor of two, 
but much larger gains are possible. The algorithm is expected to be most beneficial in models where a large number of small, dense 
regions are embedded in an environment of low mean density. 

Key words, radiative transfer - Methods: numerical 



1. Introduction 



Radiative transfer modelling is an indispensable tool in the in- 
terpretation of observations o f dusty astrophysical objects such 
as circumstellar discs (e.g.. lAcreman et al.l |20 10b . molecular 
cloud cores (e . g.. [Steinacker et al.ll2005l) . spiral galaxies (e.g., 
de Lo oze et al]l2012l) . and galaxy mergers (e.g., lHavward et al.l 
2011b . Solving the radiative transfer equation is numerically a 
very difficult problem, requiring considerable computational re- 
sources. It is sometimes possible to use one-dimensional or two- 
dimensional geometries, but for a realistic representation of in- 
homogenous structures such as turbulent molecular clouds, a 
fully three-dimensional (3D) model is needed. Moreover, an ac- 
curate description of the structure often requires the inclusion of 
a large variety of scales. For instance, a model of a circumstel- 
lar disc may require a resolution of ~ Rq near the star, while, 
to include the whole disc, the total extent of the model needs 
to be several hundred AU. On a uniform cartesian grid, such a 
model would comprise hundreds of billions cells. Furthermore, 
the radiative transfer problem needs to be solved at several 
wavelengths and iteratively, if the dust is hot and the model is 
not optically thin, to calculate the thermal dust emission self- 
consistently. 

To reduce computational cost, several 3D radiative-transfer 
codes have been developed with support for adaptive resolu- 
tion, i.e., the possibility of using a higher resolution in some 
parts of the model. With adaptive resolution, it is possible to use 
the finest resolution only where necessary, thereby reducing the 
number of cells in the model in some cases by many orders of 
magnitude. With cartesian grids, the most commonly used struc- 



ture has been the oct-tree (e.g., IJonssonI l2006t lAcreman et al.l 
l2010h . In an oct-tree, every model cell can be divided into eight 
subcells, which can then be div ided further. A complete ly dif- 
ferent approach was chosen by iRitzerveld & Ickel d2006), who 
dispenses with the cartesian grid and moved the photons along 
the edges of Delaunay triangles in a point cloud. 

Regardless of the method used to calculate the radiation 
field, iteration is needed in cases where the dust self-coupling 
is significant. The simplest and most commonly used method is 
the A iteration, but this suffers from very slow convergence in 
models with a high optical depth. Convergence can be improved 
with accelerated A iteration (ALI) at the cost of increased com- 
puter memory requirements and the additional computation re- 
quire d at each iteration step (ICannonll 19731: (Rvbicki & Hummer 
1199 ll) . Accelerated lambda iteratio n was reformulated for use 
with t he Monte Carlo methods in iHogerheiide & van der Takl 
(2000), where it was called an accelerated Monte Carlo method 
(see also lJuvela & Padoan|[l99 9). These methods are based on 
treating separately the part of the radiation emitted by a cell 
that is absorbed in the same cell or, in some variations of the 
method, in its immediate neighbourhood. Nevertheless, models 
with optical depths of several thousand, such as dense circum- 
stella r envelopes, can require tens of iterations even when using 
ALI (Juve For a large model, the computation of a sin- 

gle iteration can be very time-consuming, making solving the 
full problem infeasible. 

The programme described in this article is based on the 
Monte Carlo method. The main difference from other Monte 
Carlo radiative-transfer codes is the use of a hierarchial tree 
structure of nested grids, closely resembling that employed in 
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patch-based adaptive mesh refinement (AMR) hydrodynamics 
codes. Although hierarchial grids have been used in radiative 
transfer calculations before (e.g., Robitaille 2011), the method 
described here differs from the previous ones in some key as- 
pects. In the Monte Carlo simulation, the programme works 
grid-by-grid, moving to the next only after all photon packages 
in a grid have been processed instead of following one photon 
package at a time through the whole model. The most important 
new feature is the possibility of using subiterations, i.e., iterating 
separately those parts of the model that suffer from slow con- 
vergence. Although the current implementation uses the Monte 
Carlo method to compute the formal solution, the subiteration al- 
gorithm is independent of the solution method. The programme 
described he re has already been used in the study o f molecular 
cloud cores (Malinen et al.j 20TTt lJuvela et al.ll2.0lTh and galaxy 
mergers (Karl et al., in preparation). 

We describe how the Monte Carlo radiation transfer is per- 
formed on a hierarchial grid in Sect. 2, while the use of subiter- 
ations is explained in Sect. 3. In Sect. 4, we present results from 
some tests of the new method and compare them with a radiative 
transfer code that uses a regular 3D grid. Section 5 discusses pos- 
sible future extensions, and Section 6 presents the conclusions. 



2. Formal solution 

Because of dust self-heating and the long-distance radiative cou- 
plings between regions, the continuum radiative-transfer prob- 
lem is both non-linear and non-local. The full problem is usually 
solved using iterati ve methods, although som e 'immediate re- 
emission' codes (see lBiorkman & Wood 2001) avoid the explicit 
iteration process. Each step of the iteration typically requires the 
solution of a linear radiative-transfer problem (i.e., calculating 
the formal solution) with known dust thermal emission from the 
previous iteration. 

Our current implementation employs the Monte Carlo 
method for solving the linear transfer problem. The basic use 
of the Monte Carlo method in radiative transfer calculat ions, i.e., 
tracki ng photon packages, is well-established (see, e.g.. lWh itnev 
( 20 1 1 ) and references therein) and is not described here in detail. 
The part of the code that performs the Monte Carlo sampling in 
each grid is identica l to the uniform-grid rad iativ e-transfer pro - 
gramme described in Juvela & Padoa nl (l2003l) and l.Tuvelal (f2005). 
Therefore, the following discussion is limited to the parts of the 
programme involving the interaction between different grids. 



2.1. Structure of the model 

The dust density distribution is discretised on an adaptive mesh 
of nested grids. The grids form a tree structure, where each in- 
dividual grid is a node in a tree. Figure 1 shows an example of 
the hierarchy structure of a simple model. There is a single root 
grid (grid 0) that contains the whole simulation volume. Some 
parts of the root grid can belong to a subgrid that has a finer res- 
olution; these subgrids are children of the root grid and the root 
grid is their parent. The subgrids can have their own children 
(i.e., subgrids), which can have their own children, continuing 
until there is no need for further refinement. The depth of a grid 
is the length of the path joining the root grid to the grid; the depth 
of the root node is 0. A level of the hierarchy consists of all grids 
at the same depth. The vertices of a subgrid are always restricted 
to integer co-ordinates in the parent grid, so that a subgrid al- 
ways fully replaces an integer number of cells of its parent grid. 
Two children of a parent grid cannot overlap. In the subgrid, the 
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Fig. 1. Hierarchy structure of a simple model. 



linear size of a cell is a reciprocal of an integer (usually one half) 
times the size of a parent grid cell. 

The structure of the grid hierarchy is described in a hierar- 
chy file that lists the location and size of every grid. In addi- 
tion to the hierarchy file, there are one or more files for each 
grid that describe the density structure (possibly for several dust 
populations) and, for example in simulations of dusty galax- 
ies, the stellar emission. Basic simulation parameters such as 
the size of the model in physical units and the description of 
the required output maps and spectra are set in an initialisa- 
tion file. Because the programme was initially used mainly for 
running radiative transfer simulations in snapshots produced by 
the AMR (magneto)hyd rodynamics code ENZO dO'Shea et al.l 
2004; ICollins et alB o 10), we chose to use its format for the hi- 
erarchy file. We also wrote tools to convert data given on a uni- 
form grid or as a smoothed particle hydrodynamics snapshot into 
the format used by the programme. 

Before the simulation starts, a subgrid indentification array 
is constructed for every grid of the model. The array shows for 
each cell in the grid whether the cell belongs to a subgrid and, 
if it does, the number of that subgrid. It is used during the sim- 
ulation to quickly check whether a photon package has encoun- 
tered a subgrid. Because this needs to be done every time that the 
package crosses a cell, it is important that it can be done with a 
fast table lookup instead of a time-consuming search among the 
grids listed in the hierarchy file. 

2.2. The grid boundaries 

The external boundary of every grid has two arrays for storing 
photon packages: one for the packages that enter the grid from 
the outside (photons going inwards, inwards array) and one for 
the packages exiting the grid through its outer boundary (pho- 
tons going outwards, outwards array). If a package exits the grid 
via the outer boundary, it is stored in the grid's outwards array. 
If a package enters a subgrid instead, it is stored in that subgrid's 
inwards array. 

When a package reaches a grid boundary, the programme 
stores the number of photons in the package, the direction 
cosines, and the position where the package crosses the bound- 
ary. Some additional numbers are saved at the boundary when 
subiterations are used (see Sect 3.4). Because the number of pho- 
ton packages that need to be saved on a grid boundary cannot be 
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known when the simulations starts, the size of the arrays must 
be changed dynamically during the simulation. 

2.3. Solving the linear problem 

The simulation begins with the creation of photon packages that 
describe the radiation from sources inside the model volume, 
such as stars or dust thermal emission. The packages can be 
polychromatic so that a single Monte Carlo ray rep resents sev- 
eral wavelengths (|Jonssonl2006tljonsson et al.l2010l) . A package 
is followed step by step through the grid until it exits the grid in 
which it was created either by encountering its outer boundary 
or by entering one of its subgrids. The packet is stored in the ap- 
propriate boundary array - the grid's outwards array if the packet 
reaches the outer edge of the grid, and a child grid's inwards ar- 
ray if the packet encounters a subgrid. The simulation proceeds 
with the next package emitted inside the grid. After all packages 
are processed, the programme switches to the next grid and the 
same process is repeated. The tracking of a photon package in- 
side a grid is done using local grid coordinates, where the linear 
size of each cell is one unit. This allows us to use exactly the 
same routine for following the packages at all levels without the 
need to adjust for different cell sizes. The global coordinates that 
refer to the position in the root grid's coordinate system are only 
used when a package crosses a boundary. 

When the internal emission from all grids has been pro- 
cessed, the next step is to transport the packages stored on the 
grid boundaries. This is started from the root grid, whose in- 
wards array initially contains the photon packages entering the 
simulation volume from the outside. These represent the back- 
ground radiation, e.g., the interstellar radiation field. The pack- 
ages are followed until they exit the grid either through the outer 
boundary, or by entering a subgrid, and they are stored in the 
corresponding boundary array. The package is removed from the 
inwards array, and the simulation proceeds with the next pack- 
age. After all packages are processed and the inwards array is 
empty, the simulation continues with the next grid. The simu- 
lation moves inwards so that a child grid is always processed 
after its parent. Therefore, a child grid's inwards array can con- 
tain packages both from sources within the parent grid and from 
external radiation reaching the parent's inwards array. After all 
grids have been processed, there are no packages in any inwards 
array. The simulation continues with outwards arrays, starting 
from the deepest hierarchy level and moving outwards. Packages 
on the root grid's outwards boundary escape the simulation vol- 
ume and are deleted. Because new packages are started only at 
the beginning of the simulation when the radiation sources are 
processed, the total number of packages in the left simulation 
decreases during the computational The computation continues 
by alternatingly processing the hierarchy inwards and outwards, 
until there are no packages left. A detailed example of the pro- 
cess is given in Appendix A. 

We use the continuous absorption method described in lLucvl 
(fl999h . so that a package is normally removed from the sim- 
ulation only if it exits through the outer boundary of the root 
grid. However, for computational efficiency it is sometimes ad- 
vantageous to terminate packages when they have lost most of 
their photons. This is done by using a Russian roulette scheme. 
If the number of photons in a package falls below the limit n, it 
is deleted with probability p. If the packet survives, the number 
of photons in the package is multiplied by (1 - The param- 



1 This is not the case if package splitting at the grid boundaries is 
used. 



eters n and p controlling the Russian roulette can be set in the 
initialisation file. 

The programme also includes the possibility of using pack- 
age splitting to improve sampling in subgrids. Several photon 
packages are started for every package stored in the inwards ar- 
ray, with the number of photons in each divided correspondingly. 
When the package splitting is used, it is usually also necessary 
to employ a Russian roulette scheme at the outwards boundary 
to limit the number of packages that later continue to the parent 
grid, so that if a package entering the grid is divided into N parts, 
then the packages exiting the grid have only a 1 /N probability of 
surviving. 

2.4. Peel-off 

The peel-off method dYusef-Zadeh et alii 19841) is a technique for 
calculating images of the scattered light that produces higher 
signal-to-noise ratio images compared to the naive Monte Carlo 
simulation. In the peel-off method, one calculates after each scat- 
tering the fraction of photons that are scattered towards one or 
more observers and escape the model. The escaped photons are 
used to construct the images of scattered light. The use of the 
peel-off method increases the computational cost of each scat- 
tering event, but because every scattering contributes to the final 
image, the number of photon packages needed for a high qual- 
ity image is reduced significantly. The calculation of the scat- 
tered light maps is done as the final post-processing step after the 
dust temperature distribution has been solved with the radiative- 
transfer computation. 

We use a novel variation of the technique that accelerates the 
computation. Before the start of the simulation, the outer bound- 
aries of selected subgrids are divided into small tiles. The optical 
thickness is calculated from the centre of each tile to the outer 
border of the model for each observer direction, and the opti- 
cal thickness values are stored into a table. During the peel-off 
calculation, it is necessary to follow the package only until it 
meets the boundary of a subgrid for which the extinction table 
has been calculated. The total extinction is calculated by adding 
the extinction to the subgrid boundary to the precalculated value 
that was saved in the table. The use of precalculated extinction 
tables can reduce the computation time per photon package to a 
fraction of its original length at the cost of the need to perform 
more preliminary calculation at the start of the simulation. 

3. Local iterations 

3. 1 . The basic algorithm 

The radiative transfer equation can be formally written as 
J = AS +J Q 

S = f(J), {L> 

where J is the radiation field, S is the source function, and Jq is 
the radiation field due to constant sources such as stars. Operator 
A is a linear mapping from the source function to the radiation 
field and / is the function that relates the radiation absorbed by 
the dust to its thermal emission. If we assume, as is usually done 
in dust radiative-transfer simulations, that the opacity is indepen- 
dent of the dust temperature, A does not depend on J. Because 
the system of equations Q] has N ce u s Nf Kq unknowns, solving it 
with, e.g., Newton-Raphson iteration is not possible for large 
models. In particular, the matrix representing the discretised A 
operator has Nf Kq N^ ens (possibly) non-zero elements, needing 
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hundreds of terabytes of storage for a model with several mil- 
lion cells. 

The system of equations in Eq.Q]can be solved without ex- 
plicitly constructing the A operator by using the A iteration 

Jn+\ = AS „ + Jq 
Sn+l = f(Jn+l)- 

Each iteration step entails both solving a linear radiative -transfer 
problem for a given source function using, e.g., the Monte Carlo 
method, and calculating the source function from the radiation 
field using the dust model. In large models and with sophisti- 
cated dust models, both of these calculations are computation- 
ally expensive, limiting the ability to solve the radiative transfer 
equation in slowly converging models. 

In a model where individual cells are optically thick, most of 
the radiation emitted in a cell is absorbed in the same cell, and 
therefore does not contribute to the net energy transfer between 
different cells. This means that in the matrix representing the A 
operator, the entries on the main diagonal of the matrix are large 
compared to the other entrie s, leading to a very slow conver- 
gence of the basic A iteration (Rvbicki & H ummerll991l) . In the 
accelerated A iteration, the problem of slow convergence is mit- 
igated by explicitly treating the diagonal part of the A operator. 
The operator is written as A = An + Ai, where Ao is a diagonal 
matrix. The iteration is then run as 

Jn+i - Ao5„ + i + A\S n + Jo ^) 

Every step of the iteration involves the solution of a non-linear 
system of equations. However, because Ao is diagonal, the full 
system decouples into A ce ii s separate systems, each with Af req 
unknowns. Furthermore, because only the diagonal part of A is 
needed, storage requirements are much lower. Instead of a diag- 
onal Ao, it is possible to use more complex operators that better 
approximate A. This accelerates convergence, bu t requi res more 
storage and computation for each iteration step (lJuvelall2005l) . 

The optically thick, slowly converging regions may comprise 
only a small part of the simulation volume. The grid structure 
allows us to exploit this fact. Instead of solving the system of 
equations in Eqs.|2]or|3]in the whole simulation volume for each 
iteration, we use subiterations, i.e. take an iteration step only in 
the slowly converging grids. If the subiterations are done only in 
a small part of the model, they are faster than full iterations (see 
Appendix B for a detailed discussion). If the bulk of the model is 
optically thin, only a small number of iterations of the full model 
are needed in addition to the subiterations of the densest grids. 

3.2. Implementation 

The simulation begins with the computation of the radiation field 
produced by constant sources such as stars and the external ra- 
diation. The radiation field due to constant sources is saved for 
each cell in the model. Because we assume that the dust opacity 
does not depend on temperature, this calculation does not need to 
be repeated and only the dust thermal emission needs to be recal- 
culated in the subsequent iterations. In the following iterations, 
the dust emission for all cells in the grids that are included in the 
subiteration is computed using the previously calculated radia- 
tion field. Thereafter, the Monte Carlo radiative-transfer simula- 
tion is run to calculate the radiation field due to dust emission. 

The main difficulty in the implementation of the subitera- 
tion algorithm is ensuring that the total radiation field in a cell 
is always calculated using the data from the most recent subit- 
erations. Because the grids may have had a different number of 



iterations, this requires careful tracking of the radiation from dif- 
ferent grids. To enable this without requiring a very large amount 
of computer memory, we do not permit the inclusion of an arbi- 
trary set of grids in a subiteration. A subiteration must instead be 
done in a complete subtree. In particular, this means that taking 
a subiteration step with a single grid is allowed only if the grid 
does not have any children. We present a detailed description of 
the algorithm in Appendix B. 

We chose not to include the calculation of dust emission (i.e., 
evaluating /(/)) in the radiative-transfer programme. The total 
radiation field in each cell is written to a computer disc and a sep- 
arate programme is called to calculate the new source function. 
Therefore, any dust modelling code can be easily used with our 
radiative transfer program. If a dust model with several dust pop- 
ulations, stochastically heated non-local thermodynamic equi- 
librium (LTE) particles, and polyaromatic hydrocarbons (PAH) 
emission is used, the time spent calculating the dust emission 
can be much longer than that spent in the radiation transfer step. 
In th ese cases, it is often necessary to use acce leration methods 
(e.g., Uuvela & Padoanl 120031: iBaes et al.ll20TTh to calculate the 
dust emission. There is some overhead owing to the storage of 
data on the disc instead of computer memory. However, in all 
our tests the time spent in disc input/output (I/O) is less than 4 % 
of the total running time. Moreover, in simulations with a large 
model and a dense frequency grid the radiation field data can 
take more than a hundred gigabytes and it may be impossible to 
store the data in the memory. 

We also included the possibility of using ALI with a diag- 
onal approximate A operator. Whether ALI is used can be cho- 
sen separately for each grid. For instance, one can choose to use 
ALI in only a few grids with the highest optical thickness. The 
use of ALI accelerates the convergence in optically thick grids, 
thereby reducing the number of iterations and saving computer 
time. Furthermore, by accelerating the convergence ALI makes 
it easier to determine whether a grid h as converged. For the de- 
tails of implementing ALI, we refer to I Ju vela! (120051) . 

3.3. Automatic iteration 

The order in which different subtrees are processed can be de- 
fined by the user. However, this is impractical in models with a 
large number of grids, and it is better to let the programme auto- 
matically determine which parts of the model need more subit- 
erations. This is done by tracking the energy balance, i.e., the 
difference between energy absorbed and emitted by the dust, at 
the level of individual cells as well as entire grids, and by choos- 
ing the grids with the largest imbalance for the next iteration. 
The iteration process is terminated when the energy balance has 
been attained with the required accuracy. 

Monitoring the convergence in a single grid can be done by 
computing the difference in energy absorbed and emitted by the 
dust grains in each cell. Even if at a given point during the itera- 
tion, there is a perfect balance between the absorbed and emitted 
energy in every cell of the model, the iteration process as a whole 
has not necessarily converged. For instance, if the only radiation 
source in the model is within a very optically thick grid, energy 
balance for every cell in the model could be reached by iterat- 
ing only the grid containing the source. However, subiterations 
of larger subtrees are necessary to balance the transfer of energy 
between different grids. The programme tracks the energy flow 
from each grid to its parent. If the energy flux into a grid from 
its children has changed significantly (e.g., the change is an ap- 
preciable fraction of the total energy absorbed in the grid) since 
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the last iteration where the grid was included, the grid is tagged 
for the next iteration. 

The user can adjust the parameters controlling the iteration 
process, for instance by choosing whether the criterion deter- 
mining the next subtrees is the maximum imbalance between 
the absorbed and emitted energy by a cell, or a weighted aver- 
age over all cells in the grid. It is usually inefficient to include in 
the next iteration only the subtree starting from the grid with the 
largest energy imbalance. For instance, if a grid has only slightly 
smaller energy imbalance than one of its children, it is better to 
also include that grid in the next iteration. Otherwise, it would 
very likely be incorporated in a later iteration, and the earlier it- 
eration of its subtree could be at least partly redundant. The user 
can set the parameter controlling how close to the largest found 
energy imbalance the grids need to be in order to be included in 
the next iteration. 

The statistical noise inherent to the Monte Carlo method 
causes fluctuations even if the iteration has converged. In some 
cases, especially if the number of photon packages in the simu- 
lation is low, this can cause the iteration to get stuck in a con- 
verged subtree, while other parts of the model may not have yet 
converged. This may be alleviated by following the convergence. 
If there is no improvement, the iteration may have reached the 
noise limit, and the algorithm forces another choice of a rootgrid 
for the next iteration. However, it is important not to terminate 
the iteration prematurely, because for optically thick systems the 
convergence can be slow. 

3.4. Computational issues 

Although the grid-by-grid processing seems complicated com- 
pared to simply following one photon package at a time through 
the whole model, it introduces little computational overhead. In a 
model with a 4x 10 6 cells and « 300 grids, i.e., a relatively large 
number of small grids, only » 3 % of the CPU time is spent in the 
transfer of photon packages across the borders. This small over- 
head is easily balanced by the efficiency improvements achieved 
by processing a large model in smaller sections. The most impor- 
tant gain results from a more compact memory access pattern. 

In Monte Carlo radiative-transfer calculations the random 
walk nature of the path of a photon package makes it very diffi- 
cult to attain a good cache hit rate if the size of the model is sig- 
nificantly larger than the CPU cache. Because of the significant 
latency caused by a cache miss, the computational speed is much 
lower than what is theoretically possible. The cache hit rate can 
be improved to some extent with special indexing schemes such 
as the space-filling Hilbert-Peano curve, but this has not yielded 
significant gains in spee d owing to th e additional computation 
required by the method dJonss on 2006). In comparison, using the 
hierarchial grid structure and grid-by-grid processing provides 
an efficient way to improve the cache hit rate and consequently 
also the computational speed. When the radiation is processed 
in one small grid at a time, a much larger fraction of memory 
accesses are to cached memory locations. Even in a model with 
only one hierarchy level (i.e., a single uniform grid), it can be ad- 
vantageous to process the model in smaller parts. For instance, 
in a test calculation of light scattering in a molecular cloud, di- 
viding a 5 12 3 model into 512 64 3 cubes reduced the computation 
time to less than half. For an optimal performance, the grid size 
should be chosen so that the data for a single grid fits in the CPU 
L2 (or L3 if that exists) cache. For larger grid sizes, the cache 
hit rate drops quickly. On the other hand, a very large number 
of small grids should be avoided because of the increase in the 
overhead from boundary crossings. With typical cache sizes of a 



few megabytes, the optimal grid size is a few hundred thousand 
cells. 

The programme needs to allocate several arrays for every 
grid in the model. For a grid with N cells, the programme needs 
at least four separate arrays, each with N elements: the density, 
the emission, the absorbed energy, and the subgrid identifica- 
tionQ If subiterations are used, one array for absorbed energy 
is needed for each level of hierarchy above the level of the cur- 
rent grid. Furthermore, for grids where ALI is used, all arrays for 
storing the absorbed energy need to be duplicated to store the ab- 
sorptions from the same cell separately. In a model with a total 
of N cells and d levels and using 4 bytes for each array element, 
the worst case memory usage by the arrays is 4A^(3 + 2d) bytes if 
the frequencies are simulated one-by-one. For a relatively large 
model with 10 7 cells and 10 hierarchy levels this results in the 
worst case memory consumption of 920 MB. For real models, 
the memory usage is likely to be lower by a factor of at least 
five, because it is very unlikely that ALI is used in all grids and 
that a very large fraction of the cells are at the deepest hierar- 
chy levels. Therefore, memory consumption by the grid arrays 
is usually not a serious limitation. In the case of polychromatic 
radiative-transfer, arrays for several frequencies need to be kept 
in the memory simultaneously, and for / frequencies the worst 
case memory consumption is 4N[1 + f(2 + 2d)]. Memory lim- 
itations may restrict the number of frequencies that can be run 
simultaneously in large models. The grid-by-grid processing al- 
lows us to keep only the arrays belonging to the current grid in 
the computer memory. After completing a grid, the updated ar- 
rays can be written to the disc and the data for the next grid read 
into memory. As a result of the grid-by-grid processing, the num- 
ber of times that the data for any grid need to be written or read 
is low. However, disc access should be kept to a minimum, and 
it is better to keep the whole model in the memory if possible. 

In addition to the grid data, memory is needed for photon 
buffers that are used for communicating between different grids. 
The programme uses three 4 byte floats to store the position of 
the photon package, three floats for the direction of the package, 
and one float for the number of photons. If subgrid iterations 
are used, one 4 byte field is used to track the hierarchy level. 
If ALI is also used, one four byte field is needed to store the 
identity of the cell where the photon package started. A total of 
36 bytes is needed for every photon package^ In polychromatic 
Monte Carlo simulations each additional frequency requires 4 
bytes. If a very low Monte Carlo noise level is necessary, the re- 
quired number of photon packages can make the photon buffers 
too large to fit in the main memory. In that case, it is possible 
to store the photon buffers on the disc instead of the main mem- 
ory. Another option is to send the photon packages to the model 
cloud in smaller batches. There is some overhead due to the need 
to start the photon transfer multiple times, but in practice the ef- 
fect is negligible. 

4. Tests 

We have run tests to determine the accuracy of the programme, 
and the gains in computational cost that the new methods can 



2 A separate subgrid identification array is not strictly necessary as 
it is possible to encode the subgrid number in the unused parts of the 
density array. 

3 The memory usage could be easily reduced to 28 bytes per package, 
e.g., two floats would be enough to save the package direction and a 
single 4 byte field would be enough to save both the cell where the 
packet originated, and the hierarchy level. 
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provid e. The dust model used in all tests is based on iDraind 
(20030 and consists of a single dust component at the equilib- 
rium temperature with the local radiation field. We use a rela- 
tively sparse frequency grid with only 36 frequencies stretching 
from the ultraviolet to the far-infrared. Frequencies are simulated 
one-by-one without using polychromatic packages. 

4.1. Molecular cloud 

In the first test, we compare the results of a full resolution sim- 
ulation to results from a hierarchial model that has been gen- 
erated by combining cells with a low optical thickness, form- 
ing grids from the combined cells and building a hierarchy tree 
from the grids. We examine by how much the accuracy reduces 
with the adaptive model, and the speed-up that is possible. In 
the case of a uniform grid progra mme, the code is ide n tical to 
the pr ogrammme described in LFuvela & Padoan] (l2003l) : lJuvelal 
(2005). That programme has been tested against several other 
radiative-transfer codes, and the results were found to be in ex- 
cellent agreement (Baes et al., in preparation). 

The model represents a part of a molecular cloud that is 
heated by the external radiation field and the stars embed- 
ded in the cloud. The density field is from a snapshot of an 
isoth ermal magnetohydro dynamic simulation with the Stagger 
code (Pa doan et al.ll2007h . The simulation does not include self- 
gravity, and as a consequence the density contrasts in the model 
cloud are relatively low. We use as the full resolution model 
a 384 3 piece from a 1000 3 cube that was previously used in 
iLunttila et ail d2008l l2Q09h . The total number of cells in the 
model is thus 384 3 * 5.7 x 10 7 . 

The mean molecular hydrogen density is set to n(H2) = 70 
cirT 3 and the size of the grid to 2 pc. The mean V-band optical 
thickness through the cloud is only =s 0.4, but along some lines 
of sight it reaches values of more than 50. The exte rnal radiation 
is des cribed by the interstellar radiation field from Mathi s et al.l 
(Il983l) . The internal radiation sources are ten stars that are rep- 
resented by blackbodies with temperatures between 7000 K and 
9500 K and radii between 1.2 R and 2.1 R . The stars are lo- 
cated in regions of high density. 

The hierarchial model has a 48 3 root grid and three levels 
of refinement for a maximum effective resolution of 384 3 . The 
model has 162 grids that contain a total of 4.1 x 10 6 cells, which 
is lower by a factor of almost 14 than for the full resolution grid. 
The hierarchial model is built from the full grid by requiring 
that the V-band optical thickness of any combined cell cannot be 
larger than 0.05, and that the density contrast p max /Pmin between 
the cells merged into one is always less than 30. 

The model cloud has a relatively low maximum optical 
thickness and the dust temperature remains below 100 K. As a 
result, the cloud is optically thin to its own thermal dust emission 
and the dust temperature is almost fully converged after only one 
iteration. We therefore do not use subiterations in this model. We 
instead focus on the speed and accuracy of the formal solution 
algorithm in the full resolution and the hierarchial cloud models. 

We compare the results using surface brightness maps of the 
cloud computed at the full 384 2 resolution in directions par- 
allel to the coordinate axes. The surface brightness calculated 
from the full-resolution model and from the hierarchial model 
are shown in Fig. 2, and the relative difference 



between the results from the full model and from the hierarchial 
grid is shown in Fig. 3. The relative mean error 
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is small, less than 10 3 at all wavelengths. The relative root- 
mean-square (RMS) error 
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is larger, reaching values up to 0.04. 

The error corresponds to positions with a steep density gradi- 
ent, and where the hierarchial grid has a lower resolution than the 
full model. In the surface brightness maps that were calculated 
with the hierarchial model, the effect of the large cells can be 
seen as large areas of constant brightness (the large cells seen in 
projection). Their surface brightness is close to the mean bright- 
ness of the full resolution model over that region. However, 
because of the brightness gradient within the area, the surface 
brightness calculated with the hierarchial model is higher on one 
side and lower on the other than in the results from the full reso- 
lution model. These errors cancel each other almost exactly, and 
the mean error is very low. The effect can be easily seen by com- 
paring Figs. 2 and 3. Bands of large relative error (both positive 
and negative) can be seen in areas with steep surface brightness 
gradients. 

The error depends crucially on how the hierarchial grid is 
constructed. If it is necessary to model steep intensity gradi- 
ents accurately, the adaptive grid should be constructed with a 
small value of the largest allowed density contrast for merging 
the cells. On the other hand, for the calculation of the spectral 
energy distribution, this is not as important. In the present case, 
the high density regions were included in the adaptive grid with 
their full resolution because of the low allowed maximum cell 
optical thickness. As a result, the accuracy in the regions with a 
V-band optical thickness > 5 is good with a relative RMS error 

* io- 4 . 

When the same number of photon packages was used in both 
the hierarchial and the full resolution model, and no resampling 
(i.e., package splitting or Russian roulette) was done at the grid 
boundaries, the runtime of the hierarchial model was lower by a 
factor of 6. 1 . In the hierarchial model, resampling by a factor of 
four at the grid boundaries decreased the time spent in the com- 
putation of the radiation field produced by the stars by a factor of 
1.8, while the calculation of the effect of the external radiation 
was slower by a factor of 2.3. However, resampling a smaller 
number of initial packages is sufficient to ensure an adequate 
sampling of the radiation field. Without resampling, a very large 
number of photon packages would need to be sent to get a good 
sampling of the radiation field at the deepest hierarchy levels, re- 
quiring a time consuming computation. The largest cells would 
then be crossed by a large number of packages, and the sampling 
noise in them would be much lower than in the smallest cells. If 
the number of photon packages sent to the model is chosen so 
that the desired noise level is reached in the small cells, the larger 
ones are sampled with an unnecessarily high precision, wasting 
computation time. With package splitting, a similar noise level 
can be reached at all levels. 

To test how the resampling affects the noise level, we ran 
the Monte Carlo simulation of the 4 yum radiation field 20 times 
both with and without resampling, using the same relatively low 
xiumber of initial photon packages. The results were compared 
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Fig. 2. Surface brightness of the molecular cloud model at 100 jim. The left panel shows the map from the full resolution model and 
on the right we present the results of the adaptive model. 



to the reference solution, which was taken to be the mean of the 
solutions of all individual runs. We calculated the RMS differ- 
ence between the reference solution and the results of individual 
runs at each level of the hierarchy 



50) 



RMS 



\ N ce \\ s j 



£ (J(k) - J(k) ai ) 2 



(7) 



where Af ce ii s ,i is the number of cells on level i and the sum is 
taken over all cells in level i grids. The calculations were done 
separately for the external radiation and the radiation from the 
embedded stars. Table 1 shows the mean difference between the 
reference solution and the individual results from 20 runs. In 
both cases, the noise level is normalised to the maximum value 
found at any level either with or without resampling. 

The results of the test with external radiation show a signifi- 
cant improvement for the smallest grids where the noise level is 
highest. Without resampling, the noise level is higher by a factor 
of » 6, and to reach the same noise level without resampling, the 
number of packages would have to be increased by a factor of 
more than 30. Although the simulation of a single package sent 
to the model is slower when the resampling is used, the gain in 
computational speed is a factor of ten. In the case of the em- 
bedded stars, the gain is more modest. The noise levels in the 
most poorly sampled grids are similar with and without package 
splitting, and the gain is only due to the increased speed of simu- 
lating a single package. In the test case, the speed-up is less than 
a factor of two. 



4.2. Circumstellar disc 

To study the benefits of subiterations, we use a model of a flared 
circumstell ar disc around a pr otostar. We adopt the same density 
structure as iPinte et all (120091) 



p(r,z) =po(r/r ) a exp 



ho(r - r ) 



(8) 



where r = 100 AU, h Q = 10 AU, a = 2.625, and = 1.125. 
The inner and outer radii of the disc are 0. 1 AU and 400 AU. 
Parameter po is chosen so that the midplane /-band optical thick- 
ness is 5 x 10 4 . The disc is illuminated by a protostar that is rep- 
resented by a blackbody with a temperature T = 4000 K and a 
radius of two solar radii. We note that although the model density 
structure and the radiation sour ce are the same as in the bench- 
mark tests of IPinte et alJ d2009l) . the dust model is different, and 
the results cannot be directly compared. 

Although the density structure is in this case cylindrically 
symmetric, the disc is gridded into an adaptive 3D cartesian 
model with a 126 x 126 x 36 root grid, corresponding to physi- 
cal dimensions 800 x 800 x 240 AU. The model has nine levels 
of refinement, resulting in the maximum effective resolution of 
64000 x 64000 x 19200, or a linear resolution of 0.0125 AU. The 
number of cells in the model is approximately 8.5 x 10 6 . 

Because the inner edge of the circumstellar disc is only 0.1 
AU from the protostar, it is heated to a temperature of several 
hundred kelvin. Consequently, much of its thermal emission is at 
the optical and near-infrared wavelengths. The optical thickness 
is very high in the inner parts of the disc at these wavelengths, 
and the dust thermal radiation cannot escape directly. Therefore, 
the A iteration is expected to converge very slowly. Because the 
outer disc is cooler and has a lower density, it is optically thin 
to its own thermal emission and the iteration converges quickly. 
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Table 1. Effect of package splitting on the noise level in the molecular cloud model. 







RMS noise level (arbitrary units) 




External Radiation 


Internal Sources 


Hierarchy Level 


With Resampling 


No Resampling 


With Resampling No Resampling 





0.14 


0.13 


0.28 0.03 


1 


0.14 


0.25 


0.32 0.10 


2 


0.16 


0.50 


0.58 0.29 


3 


0.17 


1.00 


1.00 0.95 



Thus, the model is well-suited to testing the subiteration algo- 
rithm. 

We compare the accuracy and speed of the adaptive subiter- 
ation algorithm to those of the basic method of iterating the full 
model at every step. In this test, the automatic subiteration algo- 
rithm is used to select the grids that should be included in each 
iteration step. The criterion used for selecting the subsequent 
grids is the mass-weighted average of the difference between the 
emitted and absorbed energy, and all grids where the average is 
at least 0.25 times the maximum value found are chosen for the 
next iteration. The process is terminated after 30 steps, although 
the temperature distribution is not yet fully converged. We do 
not use ALI in this test. 

Figure 4 shows the CPU time used as a function of the iter- 
ation step by both the full iteration and the subgrid algorithm. 
For the subgrid method, the whole model is not included in any 
iteration and in eight steps only the innermost grid is included. 
Figure 5 illustrates the convergence of the temperature in the 
disc mid-plane at different distances from the central star. As the 
outer grids are not included in every iteration, the temperature 
further from the star is updated only in the iterations marked 
with crosses. The algorithm starts by iterating only the inner- 
most grids with the hottest dust, where the energy imbalance is 
initially the largest. Only in the later part of the calculation are 
the outer grids included. We note that the temperatures are those 



of individual cells, not azimuthal averages, and there is some 
Monte Carlo noise visible in the temperatures. 

Figures 6 and 7 show the temperature distribution in the xy- 
and xz-midplanes (parallel and perpendicular to the disc, respec- 
tively) at the end of the iteration. Only a small part of the model, 
corresponding to the three deepest hierarchy levels near the cen- 
tral star, is shown. The images are calculated with the subiter- 
ation algorithm, but the results from the full iterations are very 
similar. The RMS temperature difference between the results at 
the end of the calculation was 2.0 K, while the mass-weighted 
RMS difference was 1.3 K. As the final RMS temperature was 
76 K, the relative error was 0.026 (44 K and 0.028 with the mass- 
weighting). The discrepancy is largely due to Monte Carlo noise, 
because the difference between the final results of two runs with 
the subiteration algorithm was similar. Futhermore, the mean 
difference in the mean temperatures was only 0.03 K. In the 
largest grid that was not included in any subiteration, the final 
mean temperature after the full iterations is 0.14 K higher than 
in the subiteration results. If the subiteration algorithm is forced 
to run a single full iteration at the end, the bias disappears. 

In this test the speed-up achieved by using subiterations is 
approximately a factor of two. The relatively modest gain is due 
to two things. Firstly, in the model a relatively large fraction of 
the cells are at the deepest levels of the hierarchy, the innermost 
grid having approximately 10 % of the total cells. Secondly, the 
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Fig. 4. Cumulative CPU time used by the simulation of the disc 
model as a function of iteration number for the full iteration 
(dots) and subiterations (crosses). 



emission is strongly concentrated in the innermost grids. To im- 
prove sampling, the number of photon packages sent from a cell 
at each frequency is weighted by the total emission at that fre- 
quency from the cell so that each package starts with the same 
number of photons. The hot dust near the star emits much more 
energy than the cooler outer parts, and therefore most of the pho- 
ton packages are sent from the few innermost grids. In the subit- 
erations, the number of photon packages a cell sends is the same 
that it would send in a full iteration. Therefore, the total number 
of photon packages is almost the same in the subiteration of only 
the innermost grid as it is in the full iteration. The reduced time 
taken by the subiterations is in this case mostly due to the shorter 
time taken by the simulation of a package in a small subtree. Part 
of the acceleration is attributed to the reduced number of calls to 
the dust model evaluation, but with the simple dust model used 
in the test less than 5 % of the total time is spent in that part of 
the simulation in the case of full iterations. 

Figure 8 shows the image of the circumstellar disc in V-band 
as seen from 2 degrees above the plane of the disc. The central 
star is not seen directly because of the very high extinction in 
the disc, and the surface brightness is solely due to the scat- 
tered star light. Emission from the heated disc is at this wave- 
length less than 10~ 4 times the stellar emission. We examined 
the effectiveness of the accelerated peel-off algorithm by calcu- 
lating images of the scattered light and comparing the running 
times with the conventional peel-off method. When images of 
the scattered light were computed towards five observers, the 
accelerated peel-off with precalculated extinction tables for the 
four smallest grids provided a speed-up by a factor of 3.8. In 
calculations with a larger number of observing directions, the 
speed-up can be even more significant, because the peel-off cal- 
culations take a larger fraction of the computing time. For the 
molecular cloud model, the accelerated peel-off yielded a speed- 
up of only 1.7 for five observing directions. The smaller gain 
is due to the small size of that model and the small number of 
hierarchy levels. In such a case calculating the extinction to the 
closest grid border is often almost as expensive as determining 
the extinction to the outer border of the model. 
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Fig. 5. Convergence of the temperature at cells near the xy- 
midplane of the disc model. The red line shows the results from 
the subiteration algorithm, and the crosses indicate the iterations 
where each grid was updated. The results of the full iterations, 
where all grids were updated at each iteration, are shown with 
the black lines without markers. The cells are approximately 
0.625 AU, 1 .25 AU, 6 AU, and 24 AU from the central star. 
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Fig. 6. Temperature distribution in the xy-midplane of the cir- 
cumstellar disc model. The figure shows only an 8 AUx8 AU 
part at the centre of the 800 x 800 x 240 AU model. The boxes 
indicate the boundaries of subgrids. 

5. Future work 

5.1. Extending the method 

The use of subiterations does not depend on the method that is 
used to solve the linear radiative-transfer problem in the sub- 
tree. Although our programme uses the Monte Carlo method on 
cartesian 3D grids, in principle any coordinate system or method 
could be used. Some of the simplest changes use different coor- 
dinate systems in different grids. Near a small-sized radiation 
source, such as a star embedded in a molecular cloud, spherical 
coordinates are often better for describing the radiation field and 
the density structure, while the rest of the cloud can use carte- 
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Fig. 7. Temperature distribution in the xz-midplane of the circumstellar disc model. The figure shows an 8 AUxl.8 AU part at the 
centre of the 800 x 800 x 240 AU model. The boxes indicate the boundaries of subgrids. 
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Fig. 8. Circumstellar disc seen in the V-band. The surface bright- 
ness is due to the scattered light from the central star that is hid- 
den by the disc. 

sian grids. The computational cost of moving a photon package 
from a grid with the cartesian coordinate system to one that uses 
spherical coordinates or vice versa is not significantly larger than 
transforming between the cartesian coordinates of two different 
grids. 

Instead of simple package splitting, a much more sophisti- 
cated resampling can be done on the boundaries. Because the 
entire radiation field crossing a grid boundary is known before 
the next grid is processed, it is possible to use a resampling that 
is adapted to the radiation field. For instance, if a very large num- 
ber of packages are saved at the boundary, it may be enough to 
combine them and propagate a much smaller number of pack- 
ages further. A more complicated change would be to use differ- 
ent solution methods for the linear problem in some of the grids. 
Because the grids only communicate through their boundaries, 
the method used to solve the problem can be different in each 
grid. The only requirement is that the solution method must be 
able to calculate the radiation flux leaving the grid from the given 
radiation field at the grid boundaries, and the emission from the 
sources inside the grid. 

One possibility that could significantly accelerate the com- 
putation of optically thick grids is to explicitly calculate and 
store the entire A operator for cells in the grid. As noted 



above, because the storage requirements increase as 0(N^ n ), 
the method is practical only in small grids. However, it would al- 
low the system in Eqs.Q]in the grid to be solved directly and very 
quickly with Newton-Raphson iteration. The solution should, of 
course, be recalculated possibly several times, if the radiation 
field entering the grid changes. For grids that are too large for 
this method, the currently existing ALI with a diagonal operator 
can be extended to more complicated approximate operators. 

In some of the most detailed dust models, e.g., 
Comniegn e et al.l (I201 lh . the dust opacity is a function of 
temperature. This precludes using the radiation field from 
iterations of larger subtrees or from constant sources directly in 
subiterations. Even if the radiation sources stay constant, the 
radiation field can change because of the temperature-dependent 
opacity. Instead of separately storing the radiation from different 
hierarchy levels in each cell, the radiation must be saved in 
the inwards array of each grid. As in the case of constant 
opacity, separate arrays are needed for storing the radiation field 
from different hierarchy levels. When a subtree is iterated, the 
radiation from the inwards array of the subtree's root grid is 
transported into the subtree. In contrast to the previous case, 
radiation from the constant sources must be treated in the same 
way as emission from the medium. If the modifications allowing 
for the non-constant opacity were made, subiterations could also 
be used in line radiative transfer. The formal solution algorithm 
needs to be changed to take into account the line profiles and 
Doppler shifts, and the calculation of the dust emission has to 
be replaced with the solution of the rate equations. The use of 
subiterations is, however, identical to the case of dust radiative 
transfer with a temperature-dependant opacity. 

5.2. Parallelisation 

Monte Carlo radiative transfer can be easily parallelised within 
each iteration by dividing the photon packages between several 
processing cores. Each core can propagate the packages inde- 
pendently, and the results need to be combined only at the end 
to calculate the total radiation field in each cell. The programme 
is currently parallelised for shared memory architectures using 
OpenMP. The scaling is relatively good for a small number of 
cores. A speed-up by a factor of 4-6 is seen when using 8 cores, 
with the best gains being achieved for large models and when 
many photon packages are used. Similarly, calculating the dust 
emission for the next iteration using the dust model can be car- 
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ried out in parallel. Robi taillel (1201 ll) demonstrated good scaling 
to more than a thousand computational cores in Monte Carlo ra- 
diative transfer in distributed memory computers, especially if a 
large number of photon packages is needed. 

While it is possible to parallelise each iteration as described 
above, the use of subiterations provides another route to parallel 
computation. Any subtrees that do not share grids can be pro- 
cessed independently. For instance, in the hierarchy shown in 
Fig. 1, subiterations with grids 1 and 2 as the root grids can be 
calculated in parallel. In subtrees where a relatively small num- 
ber of packages is sufficient, dividing them between different 
processes does not scale well to a large number of computational 
cores because the initialisation of the processes and the merging 
of the results takes a larger fraction of the total running time. In 
these cases, better scaling can be reached by processing several 
independent subiterations in parallel, and dividing each subiter- 
ation between a small number of processes. Only subiterations 
involving large subtrees cannot be run independently. However, 
as these subtrees require a large number of photon packages, di- 
viding the photon packages between processes works well for 
these subiterations. The drawback of the method is that schedul- 
ing the iterations becomes more difficult. 

If there is enough memory at each computational node for 
the whole model cloud and the required buffers, the meth- 
ods described above can be used. As shown previously, this is 
likely to be the case in continuum radiative-transfer simulations. 
However, in line radiative-transfer computations several addi- 
tional arrays, such as the components of the gas velocity field 
and the local turbulent linewidths, are necessary and it may not 
be possible to keep the whole model in each node's memory. In 
these cases, the model can be divided between different compu- 
tational nodes according to the tree structure. Subtrees can be as- 
signed to different nodes so that they can process radiation inside 
its subtree independently. It is only necessary to communicate 
the packages entering or leaving the subtree to different nodes. 
Because the algorithm already stores the packages at the grid 
boundaries, it is relatively easy to modify it to send the packages 
to another computational node in large batches instead of send- 
ing each package individually. Nevertheless, balancing the com- 
putational load between nodes is difficult. Furthermore, while 
in our tests with a small number of CPU cores disc I/O does 
not consume a significant fraction of the total running time, in 
a larger-scale parallel computation this may not be the case. We 
discuss the details of efficient parallelisation in a forthcoming 
article. 

6. Conclusions 

We have presented new algorithms for radiative transfer on hi- 
erarchial grids. We have tested the algorithms in realistic test 
cases and compared the results with existing methods. Our main 
conclusions are: 

- The grid-by-grid processing provides some computational 
benefits owing to the higher cache hit rate. In Monte Carlo 
calculations, knowing the full radiation field at the grid 
boundary allows us to use adaptive resampling methods. 

- Results from a hierarchial model built from a uniform grid 
are close to the full resolution results, although the number 
of cells in the hierarchial model is smaller by more than an 
order of magnitude. Calculations with the hierarchial model 
are also faster than with the uniform grid. 

- Pre-calculated extinction tables can be used to accelerate the 
calculation of scattered flux. In a large model with a deep 



hierarchy structure, and when images of the scattered light 
are calculated for several observer directions, the speed-up 
can in practice be a factor of at least four. 
- Although in the circumstellar-disc test case using the subit- 
erations only provided a speed-up of a factor of two, in other 
cases the gain can be far more significant. The subiteration 
algorithm is most beneficial in cases where the model con- 
tains several small, dense regions with a high optical depth 
embedded within a medium with a much lower mean density. 
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Appendix A: Grid-by-grid processing 

As an example, we consider how the formal solution algorithm 
operates in the model depicted in Fig. 1. For the sake of sim- 
plicity, we assume that the Russian-roulette package termination 
and resampling at the grid boundaries are not used. The typical 
procedure is 

1. The emission from internal sources in grid 5 is simulated 
with photon packages. The packages are followed until they 
reach the outer border of that grid, when they are saved in 
the grid's outwards array. 

2. The internal emission is simulated similarly in other grids. 
The packages that encounter a subgrid are saved in the sub- 
grid's inwards array. For example, a package saved from grid 
3 may be saved in either the outwards array of grid 3 or the 
'inwards' array of grid 5. Grids at level 2 are simulated be- 
fore grids at level 1 , which are processed before the root grid. 
For grids on the same level, the order is arbitrary. 
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3. The packages that have reached the outer boundary of the 
root grid escape from the model volume and are removed 
from the simulation. 

4. Packages are created in the root grid's inwards array that de- 
scribe the external radiation field impinging on the model 
volume. 

5. The packages in the inwards arrays are processed, starting 
from the root grid and moving inwards (i.e., level before 
level 1, et cetera). The inwards array of a grid is cleared af- 
ter its packages have been simulated. We note that during 
the simulation of a grid, packages can only be entered into 
either the grid's outwards array or the inwards arrays of its 
child grids. Therefore, after the inwards packages for grid 
5 have been simulated, the inwards arrays of all grids are 
empty because of the top-to-bottom processing order. 

6. The packages in the outwards arrays are processed, starting 
from grid 5 and moving outwards. The packages in the root 
grid's outwards array are removed from the simulation. At 
the end of this step, all the outwards arrays are empty. 

7. If there are packages left in any of the inwards arrays, the 
algorithm returns to step 5. Otherwise, the computation is 
complete. 

Appendix B: Subiterations 

B.1. Implementation details 

When subiterations are used, we need to separately keep track of 
the radiation that enters the grid after visiting any of the higher 
levels of the grid hierarchy. Therefore, each grid has separate ar- 
rays for the radiation from the hierarchy levels above or equal 
to its own level. For example, a grid on level 2 has arrays for 
levels 0, 1, and 2. All photon packages keep track of the high- 
est level of the hierarchy tree (closest to the root, i.e., the lowest 
level number) that they have visited. The package's contribution 
to the radiation field is stored in the appropriate array according 
to that level. For instance, consider a photon package that is cre- 
ated in a grid at level 2, e.g. grid 4 in the hierarchy shown in Fig. 
1 . If the package travels to grid 5, i.e. a grid at level 3, via a grid 2 
that is at level 1, the package's contribution to the radiation field 
in grid 5 is added to the level 1 array of grid 5. The following 
subiterations start by calculating the dust emission using the lat- 
est radiation fields from previous subiterations. The total field in 
a cell is computed by summing the contributions from both the 
constant sources and the dust emission from different hierarchy 
levels 

/<level 

hot = /stars + J] /(0. (B.l) 
(=0 

where I(i) is the intensity stored in the level i array. 

Separate arrays for different hierarchy levels are needed to 
guarantee that the most recently calculated radiation fields are 
always used to compute the dust emission. As an example, 
we consider the subiterations in the hierarchy shown in Fig. 1 . 
Suppose that after computing the radiation produced by the con- 
stant sources we run subiterations with subtrees under grids 0, 4, 
2, 4, and 3 in that order. If the next subiteration is to be done with 
grid again as the root grid (i.e., we iterate the whole model), we 
need to find the radiation field in every cell of the model to com- 
pute the dust emission. Determining the radiation field in grid 
is simple: we can use the most recent field computed for that 
grid, i.e., the result of the first subiteration. Because the most re- 
cent subiterations did not include grid 0, they did not contribute 
to the radiation field there. 



In grids deeper in the hierarchy, the situation is more compli- 
cated. If we simply used the most recently calculated radiation 
field for grid 3, i.e., the result from the fifth subiteration, we 
would ignore the effect of dust emission from all grids except 3 
and 5. However, because we have earlier calculated subiterations 
with larger subtrees, we can use the results from these iterations 
for contributions from grids that were not included in the last 
subiteration. The level 1 array of grid 3 was last updated in the 
third iteration. It includes the effect of dust emission from grid 
2, the parent of grid 3 and the effect of dust emission from grid 
4 that had to enter grid 3 via grid 2. The array also includes the 
dust emission from grid 3 that exited the grid and finally scat- 
tered back. Similarly, the level array was updated in the first 
iteration and contains the contributions from grids and 1 . 

The subiterations are restricted to entire subtrees of the 
whole hierarchy to limit the amount of data that needs to be 
saved to track the contributions from different grids. If it were 
possible to iterate an arbitrary set of grids, it would be necessary 
to store for every cell in the model the radiation field that was 
emitted by each grid separately. For example, if in the model 
shown in Fig. 1, it were possible to run subiterations that include 
only grids 2 and 4, or grids 2 and 3, it would be necessary to store 
for each cell of grid 2 the effect of radiation emitted in grids 3 
and 4 separately. In a small model such as the one shown in the 
figure, this could be done. However, a model can contain more 
than a thousand grids, and in larger models the extra memory 
consumption would be prohibitive. When the iterations are re- 
stricted to entire subtrees, the memory consumption scales with 
the number of hierarchy levels instead of the number of grids. 

Restricting the subiterations to subtrees can result in a less 
efficient computation. If a grid needs to be iterated, all its de- 
scendants (i.e., children, children's children et cetera) are also in- 
cluded in the iteration. In practice, the grids at the deepest levels 
usually have the highest optical thickness and exhibit the slowest 
convergence, and it is rare that a grid at a high level in the hierar- 
chy needs more iterations than its descendants. Another possible 
source of problems is that two neighbouring grids can commu- 
nicate only in iterations that include a common ancestor grid 
(i.e., a grid whose set of descendants includes the neighbouring 
grids). Therefore, an optically thick region should be included in 
a single grid, or if it is divided between several grids, the com- 
mon ancestor of the grids should not be far above them in the 
hierarchy. If the common ancestor is at a high level in the hierar- 
chy, iterations of large subtrees are needed to communicate the 
effect of one part of the dense region to another. In some cases, 
avoiding this may not be possible without making the grids very 
large. However, we have not seen cases where this is a serious 
problem. 

B.2. A step-by-step example 

As an example of the subiteration method, we explain how the 
algorithm proceeds in the model shown in Fig. 1. We assume 
here that the subiterations are run with subtrees under grids 0, 4, 
and 2 in that order. 

1 . The radiation field due to constant sources such as stars is 
calculated and stored for each cell of every grid. 

2. Thermal dust emission is calculated using the radiation field 
from the previous step as the input. 

3. The radiative transfer simulation is run in the whole model 
(i.e., the subtree starting from grid 0) using the previously 
calculated thermal dust emission as the source. The calcu- 
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lated radiation field is stored in separate arrays according to 
the hierarchy level, as explained in the previous section. 

4. Thermal dust emission is calculated again. This time the in- 
put is the sum of the radiation from the constant sources 
(from step 1), and the thermal emission from each level from 
step 3. 

5. The radiative transfer is run in the subtree starting from grid 
4, which in this case consists of only that grid. Only the ra- 
diation field stored for grid 4 can change, because no other 
grids are included. Furthermore, only the level 2 array of grid 
4 can change. Because grids at higher levels of the hierarchy 
are not included, the photon packages cannot visit either lev- 
els or 1 in this step. 

6. Thermal emission from grid 4 is recalculated. In other grids, 
the radiation field has not changed, and it is not necessary to 
update the dust emission calculated in step 4. The input to 
the emission calculation is the sum of the radiation from the 
constant sources from step 1, the radiation in the level and 

1 arrays that was calculated in step 3, and the radiation in the 
level 2 array from step 5. 

7. The radiative transfer is run in the subtree under grid 2. This 
step updates the level 1 array in grids 2, 3, 4, and 5, the level 

2 array in grids 3, 4, and 5, and then the level 3 array in grid 
5. 

8. Dust emission from grids 2, 3, 4, and 5 is recalculated. 
B.3. Efficiency of the algorithm 

The total CPU time taken by the subiteration algorithm can be 
written as 



?sub = 2j [ f RT(<?<) + tdusligi)] ■ 



(B.2) 



where Arbiter is the number of (sub)iterations and t^rigi) and 
fdust(g;) refer to the time taken by radiative transfer and dust 
model evaluation step, respectively. The argument gi is the num- 
ber of the root grid for the subtree processed in the 2th itera- 
tion. For the special case where only iteration steps with the full 
model are taken, this yields 

tm = J] [fRT(O) + fdust(O)] = Mter [t«r(0) + fdust(O)] . (B.3) 



complicated dust model is used, the computational cost of the 
radiative transfer step can be negligible compared to the cost of 
the dust model evaluation. In that case, the total speed-up from 
the sub-iterations is also R l . 

For the radiative transfer step, the scaling with n ce n s depends 
on both the details of the model and the algorithm used for cal- 
culating the formal solution. For example, raytracing with short 
characteristics on a uniform grid and with a fixed number of di- 
rections scales linearly with the number of cells, while for long 
characteristics the scaling is n 4/ jj . 

For the Monte Carlo method, tju(gi) is the product of the 
number of photon packages sent, HpackagesfcX and the average 
time spent in the simulation of a single package, fphot(gi) 



tRT(gi) = Wpackageste^photCg;)- 



(B.6) 



The number of photon packages depends both on the details of 
the model and the Monte Carlo sampling method employed. The 
number of packages is smaller for smaller subtrees, but not nec- 
essarily significantly. For instance, in the circumstellar-disc test 
case, most of the packets originate in the deepest hierarchy lev- 
els, and Hpackagesfc) is almost the same for all subtrees. 

The average cost of tracking a single photon package is ap- 
proximately proportional to the number of cells it needs to cross 
before exiting the cloud, or before the package is otherwise ter- 
minated. In the case of a uniform grid and non-scattering (or 
optically thin) medium, the number of cells encountered scales 

1/3 

as n ce]]s . In a hierarchial model with package splitting at the grid 
boundaries, Russian roulette, and a strongly scattering medium, 
the situation is more complicated, but the calculation is always 
faster in small subtrees. 

In the circumstellar-disc test case, where the smallest grids 
contained a relatively large fraction of the total cells, R « 0.26 
and the running time was reduced by a factor of approximately 
two. In other models, the g ain from using subit erations can 
be much larger. For instance, M alinen et al.l (1201 ll) modelled a 
molecular cloud that had more than a hundred dense cores with 
embedded protostars. The grids containing the stars had together 
only 0.7 percent of the total cells in the model, but because of 
their high optical depths and hot dust they needed a large num- 
ber of iterations. Computing one full iteration took more than 22 
times longer than taking one iteration step in each of the grids 
that contained a star. 



The gain from using subiterations results from the smaller 
number of cells that need to be processed 



2 i= o biKr "cellsCg/) 



S"ce Us (0) 



R<1, 



(B.4) 



where n C eiis(g;) is the number of cells in the subtree under grid 
gi. The reason that R can be much less than one is that in full 
iterations the number of iteration steps needed is driven by the 
slowest converging regions, which often contain only a small 
fraction of the total cells in the model. With subiterations, these 
regions can be processed separately, and an optimal number of 
steps can be used for each. 

Because the dust emission is calculated independently for 
each cell in the processed subtree, the time used in the dust 
model evaluation step is always proportional to n ce \\ s (gi) 



fdustC?,) = Cn celi!i (gi), 



(B.5) 



where C is a constant that depends on the dust model. Therefore, 
the time spent in the dust model step with subiterations is re- 
duced by a factor of R l compared to the full iterations. If a 
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